FEMPI: A Lightweight Fault-tolerant MPI for Embedded Cluster Systems

نویسندگان

  • Rajagopal Subramaniyan
  • Vikas Aggarwal
  • Adam Jacobs
  • Alan D. George
چکیده

Ever-increasing demands of space missions for data returns from their limited processing and communications resources have made the traditional approach of data gathering, data compression, and data transmission no longer viable. Increasing on-board processing power by providing high-performance computing (HPC) capabilities using commercial-off-the-shelf (COTS) components is a promising approach that significantly increases performance while reducing cost. However, the susceptibility of COTS components to single-events upset (SEU) is a concern demanding fault-tolerant system infrastructure. Among the components of this infrastructure, message-passing middleware based upon the Message Passing Interface (MPI) standard is essential, so as to support and provide a nearly effortless transition for earth and space science applications in MPI from groundbased computational clusters to HPC systems in space. In this paper, we present the design of a fault-tolerant MPIcompatible middleware for embedded cluster computing known as FEMPI (Fault-tolerant Embedded MPI). We also present preliminary performance results with FEMPI on a COTS-based, embedded cluster system prototype.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

SHIELD: A Fault-Tolerant MPI for an Infiniband Cluster

Today’s high performance cluster computing technologies demand extreme robustness against unexpected failures to finish aggressively parallelized work in a given time constraint. Although there has been a steady effort in developing hardware and software tools to increase fault-resilience of cluster environments, a successful solution has yet to be delivered to commercial vendors. This paper pr...

متن کامل

Towards Middleware for Fault-Tolerance in Distributed Real-Time and Embedded Systems

Distributed real-time and embedded (DRE) systems often require support for multiple simultaneous quality of service (QoS) properties, such as real-timeliness and fault tolerance, that operate within resource constrained environments. These resource constraints motivate the need for a lightweight middleware infrastructure, while the need for simultaneous QoS properties require the middleware to ...

متن کامل

Failure Resilient Heterogeneous Parallel Computing Across Multidomain Clusters

We propose lightweight middleware solutions that facilitate and simplify the execution of failure-resilient MPI programs across multidomain clusters. The system described in this paper leverages H2O, a distributed metacomputing framework, to route MPI message passing across heterogeneous aggregates located in different administrative or network domains. MPI programs instantiate a specially writ...

متن کامل

MPI/RT - An Emerging Standard for High-Performance Real-Time Systems

The last several years saw an emergence of standardization activities for real-time systems including standardization of operating systems (series of POSIX standards [1]), of communication for distributed (POSIX.21 [15]) and parallel systems (MPI/RT [6] and real-time object management (real-time CORBA [14]). This article describes the ongoing work of real-time message passing interface (MPI/RT)...

متن کامل

Automatic Fault - Tolerant MPI

High performance computing platforms such as Clusters, Grid and Desktop Grids are becoming larger and subject to more frequent failures. MPI is one of the most used message passing libraries in HPC applications. These two trends raise the need for fault-tolerant MPI. The MPICH-V project focuses on designing, implementing and comparing several automatic fault-tolerant protocols for MPI applicati...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006